-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo()
#15752
feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo()
#15752
Conversation
AER Report: CI Core ran successfully ✅AER Report: Operator UI CI ran successfully ✅ |
301972c
to
83b1842
Compare
Hmm i wonder would this solve the connection issue? If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right? Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue) |
57b55cc
to
c5d0079
Compare
feeds.SyncNodeInfo()
a1a4281
to
61297ab
Compare
As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me. |
Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler. |
61297ab
to
5c30694
Compare
e540377
to
b2f8386
Compare
|
b2f8386
to
4ff90cc
Compare
Flakeguard SummaryRan new or updated tests between View Flaky Detector Details | Compare Changes Found Flaky Tests ❌2 Results
ArtifactsFor detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json. |
4ff90cc
to
cd86fbb
Compare
cd86fbb
to
8fa7efc
Compare
Flakeguard SummaryRan new or updated tests between View Flaky Detector Details | Compare Changes Found Flaky Tests ❌2 Results
ArtifactsFor detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json. |
8fa7efc
to
6abb8aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration. After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost. This PR attempts to solve this by adding a "retry" wrapper on top of the existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours. Ticket Number: DPA-1371
6abb8aa
to
a42860e
Compare
|
There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.
After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.
This PR attempts to solve this by adding a "retry" wrapper on top of the existing
SyncNodeInfo
method. We rely onavast/retry-go
to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.DPA-1371